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ABSTRACT 

We propose to demonstrate LiquidXML, a platform for man- 
aging large corpora of XML documents in large-scale P2P 
networks. All LiquidXML peers may publish XML docu- 
ments to be shared with all the network peers. The chal- 
lenge then is to efficiently (re-) distribute the published con- 
tent in the network, possibly in overlapping, redundant frag- 
ments, to support efficient processing of queries at each peer. 
The novelty of LiquidXML relies in its adaptive method of 
choosing which data fragments are stored where, to improve 
performance. The "liquid" aspect of XML management is 
twofold: XML data flows from many sources towards many 
consumers, and its distribution in the network continuously 
adapts to improve query performance. 
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H. 2.4 [Database Management]: Systems — Distributed 
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I. INTRODUCTION 

We consider the problem of building large-scale, decen- 
tralized XML stores, capable of efficiently evaluating XML 
queries over documents indexed in a DHT based peer-to-peer 
network. Our solution is based on the previously built plat- 
form ViP2P (standing for Views in Peer-to-Peer) which we 
developed 4 . In ViP2P, any peer may publish XML doc- 
uments, which it is willing to share with the other peers. 
Moreover, any peer may establish long-running subscrip- 
tions to XML content published anywhere in the network, 
that matches a given subscription query. The results of such 
subscriptions are stored at the subscriber peer, and adver- 
tised in the DHT network, so that other peers may re-use 
them to answer their own queries, with less computation 
effort. Conceptually, thus, the result of each subscription 
can be seen as a materialized view, based on which subse- 
quent queries can be rewritten. It is important to note that: 
(i) the queries defining the subscriptions (and not the sub- 
scription results) are indexed in the DHT network, leading 
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Figure 1: LiquidXML platform architecture. 

to a small overhead of data sharing; and (ii) we consider a 
collaborative scenario, where each peer is willing to share its 
subscriptions/views with any other. ViP2P has been shown 
to scale on up to 500 peers, and 100 GB of XML data. A 
separate development on top of ViP2P, illustrating P2P doc- 
ument annotations, was demonstrated 3 . 

Our proposed demo features LiquidXML, a system built 
on top of ViP2P. Its main technical innovation is to auto- 
matically select and continuously adapt the set of material- 
ized views on each peer, to improve query processing per- 
formance both for the view holding peer, and for the other 
network peers. LiquidXML continuously adapts by adding 
more materialized views and/or replacing low- utility views 
with more useful ones according to the query workload. Fig- 
ure [T] outlines LiquidXML's architecture, on top of ViP2P. 
The modules shown in thick, white boxes are novel to Liq- 
uidXML and the main focus of the demo. 

2. LIQUIDXML PLATFORM OUTLINE 

The main aspects of the LiquidXML content management 
platform can be summarized as follows. 
Peer space budget When joining the network, each peer 
declares a space budget that it can spend to store data struc- 
tures aimed at improving query performance for itself and 
for the other network peers. Upon joining, the space budget 
of the peer is unused (empty) . 

Document-level indexes LiquidXML builds document- 
level indexes, distributed in the network. For each term 
(element, attribute name or word) appearing in an XML 
document, the URIs of all the network documents featur- 
ing that term is stored by some peer. This index allows to 
locate all documents which may contain answers to a given 
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Figure 2: Demonstration screenshots: peer network and sample query rewriting (left); sample view data and 
simple physical plan (center); LiquidXML's adaptation monitoring window (right). 



query. Sending the query to the corresponding peers leads 
to obtaining the results. This query answering mechanism is 
not always performant, since (i) some documents may not 
lead to answers, even if they contain all the query terms, 
and (ii) queries are always evaluated from scratch. 
Query statistics at a peer Each peer p is aware of a query 
set Qp, which is a subset of all the queries being asked in the 
network. This set contains the queries asked by p, and the 
queries which p helped answering based on the data stored 
at p. Peer p collects, for each query q G Q, its frequency 
{#q) in a given time window of length r. 
Peer candidate views Based on its query statistics, each 
peer p identifies a set of XML views to be materialized and 
shared with other peers that may need them, in order to 
reduce the response time of p and other peers' queries. The 
general problem of finding all candidate materialized views 
for a given XML query workload is very complex 5 . In 
LiquidXML, we use a data cube- style [2 lattice to identify 
the most appropriate candidates. For each of the candidate 
views, p computes: (i) a cost estimation (in terms of size) 
and (ii) the benefit (in terms of network and computation 
savings) that the candidate view would bring to the system 
if it is materialized. 

View size estimation For each document published by 
peer p, a compact document synopsis is also indexed in the 
DHT. The document synopsis is based on XSum [l], our 
own Dataguide implementation and is used to estimate the 
contribution (in size) of a document to a candidate view. To 
estimate the size of a candidate view, the synopses of all doc- 
uments that may contribute to the view are retrieved. The 
total estimated size of a candidate view, denoted size{v)e, 
is the sum of all document contributions to the view. 
Query cost estimation The presence of a new materialized 
view may change the way a query is processed, if the query 
can be rewritten based on that view. We assign to each 
rewriting a cost estimation, reflecting the amount of data 
transmitted between peers to evaluate the rewriting. Given 
a set of materialized views V and a query we denote the 
estimated cost of answering the query q as cost{q, V)e- 
View benefit estimation Given a candidate view the 
total set V of views currently materialized in the network, 
and a query workload Q, we estimate the benefit of v for Q 
with respect to V as: 

Mv, Q, V) = E,gQ(#'Z) X {cost{q, V). - cost{q, V U {«}).) 

Putting it all together: LiquidXML adaptation Each 
LiquidXML peer continuously gathers statistics and costs 
as outlined above. At regular r intervals, each peer enumer- 
ates candidate views and materializes those maximizing the 



benefit-to-size ratio, up to the limit of its space budget. Ex- 
isting views with a low benefit-to-size ratio can be dropped 
to make room for more interesting ones. 

3. IMPLEMENTATION AND SCENARIO 

LiquidXML is implemented in Java, on top of ViP2P, us- 
ing the FreePastry DHT as the underlying P2P network and 
the BerkeleyDB library to store materialized views. The 
supported query language is a core subset of XQuery, con- 
sisting of conjunctive tree patterns with joins. 

LiquidXML's GUfl will enable demo attendants to: (i) con- 
nect to any peer and inspect its views, queries, and statis- 
tics; (ii) control adaptation parameters, e.g. synopsis size, 
the adaptation time window etc. (Hi) view the evolution of 
the peers' views over time, (iv) view logical plans resulting 
from rewriting and the resulting distributed physical query 
plans. Figure [2] shows some sample screenshots. 

We will show the demo on 250 machines of the Grid5000 
network (http://www.grid5000.fr). We will trace query ex- 
ecution and performance in three scenarios: 

1. Document-level indexes: Only document-level indexes 
will be used to locate the documents potentially containing 
query results, to which the query is shipped. 

2. User-defined views: Users may manually define specific 
views to materialize. Queries will be answered by rewriting 
them in terms of the user-defined views. 

3. Full adaptive LiquidXML: Peers automatically ad- 
just their views to match the needs of the distributed query 
workload. Demo attendees will visualize the set of views 
on each peer, as it varies over the time. More information 
about LiquidXML can be found in our technical reporfl 
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